A Compressed Self-Index for Genomic Databases

نویسندگان

  • Travis Gagie
  • Juha Kärkkäinen
  • Yakov Nekrich
  • Simon J. Puglisi
چکیده

Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals’ genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv (RLZ) compression takes advantage of this property: it stores the first genome uncompressed or as an FM-index, then compresses the other genomes with a variant of LZ77 that copies phrases only from the first genome. RLZ achieves good compression and supports fast random access; in this paper we show how to support fast search as well, thus obtaining an efficient compressed self-index.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Self - Indexing Based on LZ 77 ? Sebastian

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that sou...

متن کامل

Self-Index Based on LZ77

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that sou...

متن کامل

Combining Text Compression and String Matching: The Miracle of Self-Indexing

This decade has witnessed the raise of what I consider the most important breakthrough of modern times in text compression and indexed string matching. Selfindexing is the mechanism by which a text is simultaneously compressed and indexed, so that the self-index occupies space close to that of the compressed text, provides random access to any part of it, and in addition supports efficient inde...

متن کامل

A Faster Grammar-Based Self-index

To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on grammars. In this paper we show how, given a straight-line program with r rules for a string S[1..n] whose LZ77 parse consists of z phrases, we can store a self-index for S in O(r + z log log n) space such that, given a pattern P [1..m], we can list the occ occurrences ...

متن کامل

Application of Fractal Codes as Similarity Measure for Compressed Image Databases

In image database applications, it is desirable that functions such as searching, browsing, and partial recall be done without totally decompressing the images. Using wavelet-compressed images is becoming increasingly popular. Image databases, and edge images derived from such compressed images can be viewed as indexes that can be queried by examples. In this research, a fractional code generat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1111.1355  شماره 

صفحات  -

تاریخ انتشار 2011